6 - 29.6 Word Embeddings [ID:35276]
50 von 119 angezeigt

Hello everybody and welcome to the last video nugget on natural language processing.

Today's topic is word embeddings.

Word embeddings is a technique that's quite useful in a number of applications and we've

already seen one instance of it.

So the idea, the upshot is basically that NLP techniques are about words and documents

and such things, but we need numerical data.

So a word embedding is just basically a mapping from words in a context, usually a document

context into a real value vector space Rn that is used, this mapping is used for language

processing.

We've seen vectors which we call one hot vectors where basically all components are zero except

for one, that is one.

And so embedding we'll call one hot if all of its vectors are.

So we've seen those things one hot word embeddings as information retrieval.

And basically these word frequency vectors are used, are obtained by adding up one hot

word embeddings.

And so we've seen another example that's the TFIDF word embedding, which is basically given

by adding up the TFIDF scores of a document in, of a term in a document, which is again

within itself in a document, in a document collection.

So the intuition between those is that words that occur in similar documents are similar.

We can indeed do more with word embeddings and that's what I want to go into now.

One of the early but extremely popular word embeddings is Word2Vec, something you can

download is very easy to obtain.

So the idea that the logistics behind this is what is often called distributional semantics

and the slogan is a word is characterized by the company it keeps.

And the idea is that you can tell the meaning of the word essentially by looking at its

context in a document and the main idea is things like if you have something, if you

look at the words that can say come after the word loves, then the object that can be

loved will be similar or the objects that are before, the words that are before loves,

the subject of a loving relationship are also similar just by being near the word loves.

And so this idea of a word is characterized by its context is, gives rise to this idea,

and we find word embeddings that take the context into account.

And what comes out of this idea, just to give you a preview, is what is called semantic

word embeddings and they have this very nice property that the vector difference is something

that can be used to encode word relations.

So for instance, if you take the vector for king and subtract from it the vector for queen,

then the difference vector, I've marked it here, is essentially the same as between the

difference vector between man and woman.

So you can, so essentially the male-female relation of words is expressed in Vertaweck

as difference vectors, which can be used very easily.

So if you want to know, if you know the vector for man, woman, and king, then you can kind

of make a prediction of what the vector of the object that is to king as woman is to man,

in this case, the queen.

And that works quite nicely.

The same thing you can do for verb tense, walking and walked, swimming and swam, but

also for things if you have enough data is for country and capital.

If you want to find out just empirically what the country capital relation is, just subtract

the word to vectors.

So if you know, if you have kind of the, you always have the same or about the same word

vector differences for Spain and Madrid and Italy and Rome and so on.

So we've kind of added something to the idea before, namely the words that occur in similar

Teil eines Kapitels:
Chapter 29. Natural Language Processing

Zugänglich über

Offener Zugang

Dauer

00:15:37 Min

Aufnahmedatum

2021-07-02

Hochgeladen am

2021-07-02 14:58:02

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen